The persistent homology of genealogical networks

Genealogical networks (i.e. family trees) are of growing interest, with the largest known data sets now including well over one billion individuals. Interest in family history also supports an 8.5 billion dollar industry whose size is projected to double within 7 years [FutureWise report HC-1137]. Yet little mathematical attention has been paid to the complex network properties of genealogical networks, especially at large scales. The structure of genealogical networks is of particular interest due to the practice of forming unions, e.g. marriages, that are typically well outside one’s immediate family. In most other networks, including other social networks, no equivalent restriction exists on the distance at which relationships form. To study the effect this has on genealogical networks we use persistent homology to identify and compare the structure of 101 genealogical and 31 other social networks. Specifically, we introduce the notion of a network’s persistence curve, which encodes the network’s set of persistence intervals. We find that the persistence curves of genealogical networks have a distinct structure when compared to other social networks. This difference in structure also extends to subnetworks of genealogical and social networks suggesting that, even with incomplete data, persistent homology can be used to meaningfully analyze genealogical networks. Here we also describe how concepts from genealogical networks, such as common ancestor cycles, are represented using persistent homology. We expect that persistent homology tools will become increasingly important in genealogical exploration as popular interest in ancestry research continues to expand.


Introduction
The study of genealogical networks, that is networks relating parents with children and spouses with each other through successive generations is of rapidly growing interest, both because of genealogy's popular appeal and its applications in genetics (Kaplanis et al. 2018), sociology (Hamberger et al. 2011, population sciences (Rohde et al. 2004), and economics (Greenwood et al. 2014). Growing data availability of rich, temporally resolved data is also driving interest in genealogy. For example, FamilySearch has constructed a human family tree with over 1.40 billion individuals, based on 2.21 billion sources, including 4.78 billion images (https:// www. famil ysear ch. org/ en/ newsr oom/ (2012), structural voids arise when several groups of neurons are strongly connected sequentially, but out-of-sequence pairs are only weakly connected. In these neurological networks, persistent homology provides a way to identify and classify these different sequences as well as quantify the strength of these connections. The application in Duman and Pirim (2018) provides a method for extending traditional genetic analysis tools to a parameterized family of datasets by constructing an appropriate topological object. Lastly, Mattia et al. (2016) shows that structural voids or gaps can also represent much more abstract concepts. In this case persistent voids are shown to correspond to the atonality in music compositions.
Intuitively, the voids or gaps in genealogical networks should be quite different when compared with other networks, such as social networks, since unions 1 (such as marriages) in genealogical networks typically form at specific distances, rather than through other mechanisms e.g. triadic closure. That is, distances between individuals who form unions are typically not too small or too large (see "Background: genealogical and social networks"). In contrast, in other social networks, new connections can form at any distance but are often quite small (Sintos and Tsaparas 2014). This difference in network growth between genealogical and other social networks causes differences in network topology that are reflected in the network's persistent homology. Thus persistent homology is a useful descriptive tool for exploring and modeling the structure of genealogical networks.
Here, we propose a new method for representing persistent homology, which we call a persistence curve (see "Comparing networks using persistent homology"). The persistence curves of many genealogical networks are very similar to each other, and importantly the persistence curves of subsets of genealogical networks, that is, sampled genealogical networks, are also similar to the persistence curves of unsampled genealogical networks (see "Results").
To give our study of genealogical networks context we also study the persistent homology of social networks. We find that the same result holds for the social networks we consider, in that the persistence curves of social networks show a common pattern and the persistence curves for social and sampled social networks are similar (see "Results"). We confirm our analysis using another tool for comparing persistent homologies, the bottleneck distance, which is also capable of detecting and differentiating the distinct homology patterns between genealogical and other social networks.
In summary, we make the following contributions: • Introduce the notion of a persistence curve and introduce the use of this together with the bottleneck distance as a tool for the analysis of general networks. • Report the distinct persistent homology structure of genealogical networks using both persistence curves and the bottleneck distance. • Link this structure to genealogically relevant concepts. • Similarly, report the distinct persistence homology structure of social networks and compare this to the structure of genealogical networks.
• Report evidence that persistent homology methods work well even in the presence of incomplete data. This is particularly relevant given that genealogical data is often, if not necessarily, incomplete.
Throughout the paper, examples from family networks are contrasted with other social networks to highlight the unique features of genealogical networks from a persistent homology point of view. The paper is organized as follows. In "Background: genealogical and social networks" we describe both genealogical and social networks. In "Persistent homology of networks" we define the persistent homology of a network and introduce the notion of persistence curves. In "Comparing networks using persistent homology" we define the bottleneck distance and show how both this distance and persistence curves can be used to compare networks. In "Results" we describe the genealogical and social data sets we use in our study and give our experimental results in "Results". In "Results" also includes a discussion of how certain structural features of social and genealogical networks are represented using persistent homology. In "Conclusion" we summarize our results and conclude with a discussion regarding the use of persistent homology as a tool for analyzing general network structure and recovering network features. Throughout we give examples of each of the concepts we introduce.

Background: genealogical and social networks
We represent genealogical networks with a graph G = (V , E) , where V = {1, 2, . . . , n} are the individuals within the network, and E are the (genealogical) relationships. These relationships consist of both parent-child edges and spouse (or more generally union) edges. For the sake of simplicity, these edges are considered to be undirected.
We note that the structure of a genealogical network is often thought of as being "treelike", since genealogical networks are often constructed from an individual, their parents, their grandparents, and so on, ignoring union edges. The result is a tree, i.e. a connected acyclic graph, if we create only a few generations of the family. However, full genealogical networks are not trees due to the presence, for example, of triangles consisting of two parents and a child (with the two parent-child edges and one union edge). Because of the frequency of such cycles and the fact that they are the smallest possible cycles, we refer to them as trivial cycles. The other typical familial cycle, or cycle found within a family consisting of two parents and some number of children, is a cycle of length four consisting of two parents and two children.
Although familial cycles are ubiquitous in genealogical networks, they are not the only cycles that can form. Going far enough through an individual's ancestors, it is often possible to find a nearest common ancestor, i.e., a common ancestor of one's father and mother. If such an ancestor exists (and it usually does exist), then the genealogical network has a nontrivial cycle. We refer to this as a common ancestor cycle, which consists of only parent-child edges. Other nontrivial cycles are possible in genealogical networks via unions. For instance, a "double cousins" relationship occurs when two siblings from one family form unions with two siblings from another family. The result is a union cycle, or a cycle that contains only union edges and the parent-child edges connecting siblings.
In genealogical networks, union and parent-child edges can combine in any number of ways to create complex non-tree structures (see Fig. 1

left).
A feature that is particular to genealogical networks is that union edges typically form at specific distances within these networks. Here the distance d(i, j) between i and j is the shortest path distance between these individuals if such a path exists. Otherwise, it is infinite. In a genealogical network we refer to the distance between two individuals before they form a union as the couple's distance to union. For cultural, genetic, and other reasons these distance are typically not small, i.e. usually larger than four. Consequently, genealogical networks do not typically have small nonfamilial cycles and often have large extended cycles. This is illustrated in Fig. 2 where distance to union data is collected from 104 publicly available genealogical networks given in Table 2 in the Appendix. Here familial cycles are omitted and the height of each bar represents the The histogram representing the finite "distance to union" distances is shown where data is collected from 104 genealogical networks from kinsources.net. The height of each bar represents the fraction of unions that form at a specific distance fraction of unions that form at a specific distance. Noticeably, few unions form at distances less than five with the large majority of distance falling between 5 and 10.
The observation that genealogical networks have large extended cycles is illustrated in Fig. 3. Shown left in orange is the distribution of cycle lengths of the San Marino genealogical network, a network of the population of the Republic of San Marino from the 15th to the end of the 19th century (https:// www. kinso urces. net/ brows er/ datas ets. xhtml). In this network, which consists of 28,586 individuals, there are 7,146 familial cycles of length three and 8,636 familial cycles of length four. These are omitted in the figure so we can observe the lengths of the cycles forming a basis of nonfamilial cycles in the network. For the sake of contrast, in blue is the distribution of cycle lengths in a basis of the cycles found in the Deezer Europe social network, consisting of 28,281 individuals. Here, similar to genealogical networks, a social network is represented by a graph G = (V , E) where the vertices V also represent individuals. The difference is that in a social network the edges represent some type of social interaction(s). The Deezer network is an online music streaming platform whose social network represents individuals in Europe who use the platform where edges represent mutual user-follower relationships.
Noticeably, the San Marino network has relatively few nonfamilial basis cycles under length ten but quite a few cycles with lengths greater than thirty. In contrast, the Deezer social network has a much tighter distribution of basis cycles ranging from roughly five to fifteen in length.
To understand the extent to which these cycle distributions are related to the local structure of the associated networks we compare these to the cycle distribution of the associated configuration models of these two networks, respectively. The configuration model is a model for generating random networks with a given degree sequence (Newman 2006). Taking the degree sequences from both the San Marino genealogical and Deezer social network, we create ten versions of these networks each with the same degree sequences. The result of averaging the basis cycle length distributions of these versions of the San Marino and Deezer networks is shown in Fig. 3 (center and right in Fig. 3 Left: Shown in orange is the distribution of the lengths of the cycles forming a basis of the nonfamilial cycle lengths in the San Marino (SM) genealogical network. The analogous distribution of cycle lengths is shown in blue for all cycles in the Deezer Europe (DE) social network. Center: Shown in orange is again the basis cycle length distribution of the San Marino genealogical network. In red is the distribution of the basis cycle lengths averaged over ten realizations of the (loopy, multi-edged) configuration model on the San Marino network. Since the configuration model generates graphs with the same degree distribution as the SM network, this panel indicates that SM's longer cycles do not arise simply from the degree distribution. Right: Shown in blue is again the basis cycle length distribution of the Deezer social network. In green is the distribution of the basis cycle lengths averaged over ten realizations of the configuration model on the Deezer social network. For this social network, the cycle length distribution can be mostly explained by the degree distribution alone red and green, respectively). While the cycle distribution for the San Marino network is quite different from what the configuration model produces, the Deezer social network is quite similar to the distribution predicted by its configuration model. This suggests that much of the cycle structure in the Deezer social network is dominated by local interactions, whereas the cycles in the San Marino genealogical network are affected by nonlocal mechanisms that form the network. This includes, presumably, the nonlocal distance to union phenomena described above.
The relations we see in Fig. 3 between the cycle length distribution for the San Marino genealogical network and the Deezer social network are typical of the genealogical and social networks we consider in "Data". This suggests that cycle length distribution is a feature that can be used to distinguish genealogical from social networks. Specifically, when we consider two networks with a similar number of cycles, genealogical networks have a much wider distribution of cycle lengths than social networks. However, the method used to calculate the cycle length distribution in Fig. 3 does not provide any further insight into this phenomenon. This limitation motivates us to apply tools from persistent homology which provides ways to describe and measure the relation between any two network cycles. The additional structure that can be obtained by these methods allow us to further distinguish the structure of genealogical and social networks (see "Network comparison using bottleneck distance") and to relate the structural differences demonstrated in Fig. 3 to mechanisms that produce genealogical and social networks, respectively (see "Connections").

Persistent homology of networks
Persistent homology provides a method for studying cycles in a network. For the purposes of this paper, a brief explanation of persistent homology will be given from the context of simplicial homology. For a more in-depth treatment of simplicial homology, see Chapter 2.1 of Hatcher (2002). For those readers who are either familiar with the basics of persistent homology or who wish to skip the following technical discussion it is possible to proceed to "Data" where we discuss the social and genealogical networks we analyze.
For a network given by a graph G = (V , E) we define the distance matrix D(G) = [d ij ] to have entries d ij = d(i, j) , which is the length of the shortest path between individual i and j. For each value δ that appears in the distance matrix D(G), we form a simplicial complex G δ as follows. The set of 0-simplices is equivalent to the set of vertices of G, where each 0-simplex is identified with a single vertex. Since the distinction between 0-simplices and vertices is purely formal, we will use the terms 0-simplex and vertex interchangeably, and the 0-simplices will be indexed the same way as the vertices. The set of 1-simplices E δ corresponds to the set of edges {i, j} such that d(i, j) ≤ δ , where the edge {i, j} is identified with the 1-simplex formed by i and j . Again the distinction here is unnecessary for our present discussion, so we will use the same notation for 1-simplices and edges. However, the simplicial complex G δ may also contain objects that do not have equivalent representatives in the graph G, namely the n-simplices for n ≥ 2 . For each integer n ≥ 2 , the set of n-simplices in G δ consists of all n-simplices [a 0 a 1 . . . a n ] In order to simplify our remaining definitions, we extend our definition of G δ to include all non-negative integers. For i ≥ 0 , let δ i be the greatest entry of D(G) such that This definition together with our construction of G δ ensures the following three important properties are true for all G i . For the values i = 0 , 1, 2, 3, we form four simplicial complexes,

For
Thus, G 0 consists of six vertices. For i = 1 the set E 1 contains the six edges that form the network's single cycle, so G 1 = G . This graph has no trivial cycles (i.e., triangles), so G 1 contains no simplices of dimension greater than 1 (i.e., no n-simplices for n > 1 ). For i = 2 the set E 2 gains six additional edges. We also now have eight trivial cycles. Each of these cycles is the boundary of a 2-simplex, so G 2 contains these eight 2-simplices as well. However, no subset of these 2-simplices forms the boundary of a 3-simplex, so G 2 has no simplices of dimension greater than 2. For i = 3 the set E 3 contains all possible edges between the vertices of G, so all possible trivial cycles are present. Additionally, all possible 2-simplices, and hence all possible n-simplices, are also present in G 3 . In particular, G 3 is a 6-simplex with its boundary. Since M = 3 is the largest value we see in the distance matrix, then G i = G 3 for i ∈ Z , i > 3. D(G) = 0 1 2 3 2 1 1 0 1 2 3 2 2 1 0 1 2 3 3 2 1 0 1 2 2 3 2 1 0 1 1 2 3 2 1 0 . Fig. 4 The hexagonal network G = G 1 in Example 3.1 is filled in as i increases from 0 to 3. This produces the simplicial complexes G 0 , G 1 , G 2 , G 3 shown left to right The persistent homology of the network G measures how the homology of G i changes as i increases. If certain features can be identified across multiple values of i, we say they persist. Intuitively, features that arise from the actual network structure should persist for many values of i, while features that arise because of measurement error, 'noise' , should only appear sporadically. The Stability Theorem (the main theorem of Cohen-Steiner et al. (2007)) states that if the error in measuring a network is bounded by some constant C, then the persistent homology of the true network and the persistent homology of the noisy network will differ by at most C. We will make this statement more precise in "Persistence diagrams and bottleneck distance".
Here we give a formal definition of persistent homology in terms of simplicial homology, which we will immediately follow this with equivalent definitions in the context of networks. We use H p (G i ) to denote the dimension-p simplicial homology of the simplicial complex G i with coefficients in Z 2 , as H p (X) is a vector space of Z 2 .

Definition 1 (pth Persistent Homology) For a graph G, and integers
Our analysis in the "Comparing networks using persistent homology" and "Results" sections only requires the first few dimensions of persistent homology to distinguish the genealogical and social networks we consider. In order to better understand what persistent homology calculates, in what follows we will provide equivalent definitions for PH 0 , PH 1 , and PH 2 using network concepts. We also illustrate how these definitions apply to the hexagonal network in Fig. 4b. (See Examples 3.3, 3.4, and 3.5 for PH 0 , PH 1 , and PH 2 ; respectively.) Definition 2 (Births and Deaths) Let G = (V , E) be a network with simplicial complexes G 0 , G 1 , G 2 , · · · . The pth persistent homology of G provides maps φ i,j between the pth homology of G i and the pth homology of G j . Suppose that basis elements have been

Remark 3.2
Those already familiar with persistent homology will find that the preceding definition is somewhat nonstandard, although it is equivalent to the standard definition. We have taken this approach to reduce the notation burden on non-specialist readers. We have done similarly with some of the other persistent homology definitions.
We will demonstrate how to choose such representatives for H 0 , H 1 , and H 2 in the following definitions. Given such representatives, though, the maps φ i,j and φ j,k are simply the maps on homology induced by the inclusion maps G i ⊂ G j ⊂ G k . That is, if a represents α ∈ H p (G i ) , then a also represents φ i,j (α) . The Fundamental Theorem of Persistent Homology ensures that we can choose a single representative that corresponds to α ∈ H p (G j ) , α ∈ H p (G i ) , and φ j,k (α) ∈ H p (G k ) . The birth of α is then just the first G i in which the representative exists, and the death of α is the first G k in which the representative is null-homotopic i.e., homotopic to a trivial cycle.

Definition 3 (Representing Persistent Homology
2 , so we can identify the basis for H 0 (G 0 ) with the set of all n vertices. Likewise, we may choose k vertices, one from each connected component, to represent the basis for H 0 (G i ) ∼ = Z k 2 for i ≥ 1 . Thus, we will refer to the vertices of G as representatives of PH 0 (G) . (In fact, PH 0 (G) is a vector space whose basis elements are equivalence classes of formal sums of 0-simplices.)

Example 3.3
We now consider PH 0 (G) for the hexagonal network G in Fig. 4, with G 0 , G 1 , G 2 , and G 3 in the same figure. Recall that G has six distinct vertices forming one connected component. If we take any numbering of the vertices, we call this the birth of v. At i = 1 , since we have removed all vertices except 1 from the basis, we say this is the death of those five 0-simplices. Since 1 will always be in the basis for G i , the death of 1 is said to be ∞.
Definition 4 (Representing Persistent Homology: Dimension 1) Let G = (V , E) be a network with one connected component. For each i ≥ 0 , we can identify the basis of H 1 (G i ) with a set C i of cycles in G i . The Fundamental Theorem of Persistent Homology allows us to choose these cycles so that if σ is a cycle in C i , then exactly one of the following is true for any integer j ≥ 0 : Thus, we will refer to the cycles in i≥0 C i as the representatives of PH 1 (G) . (Again, PH 1 (G) is actually much larger than this. These are actually representatives of equivalence classes that form a basis for PH 1 (G) as a vector space.) We note that C 0 is always empty, since there are no edges in G 0 . Furthermore, rank(H 1 (G i )) = |C i | for all i ≥ 0 . Because of the construction of the G i all representatives of PH 1 (G) will be present in G 1 . One can think of the representatives of PH 1 (G) as representing "large" cycles. More specifically, if a cycle σ is contained in s≤i≤t C i , then it must have a diameter of at least t and at least one pair of consecutive vertices distance s apart.

Example 3.4
we now consider PH 1 (G) for the hexagonal network G in Fig. 4b. In both Fig. 4a and 4b we see that G 0 has no cycles, G 1 has exactly one cycle, and that the cycle in G 1 is non-trivial. In Figs. 5a and 5b, we have indicated some of the cycles in G 2 , namely the cycles 1,2, 3,1; 3,4,5,3; 1,5,6,1; and 1,3,5,1 in Fig. 5a and the cycle 1,2,3,5,1 in Fig. 5b. In fact, Fig. 5c shows us that G 2 is an octahedron and therefore every cycle in G 2 is either trivial or nullhomotopic. Finally, G 3 contains even more cycles than G 2 , such as 1,3,6,1; but these are all null-homotopic since G 3 also contains every possible 2-simplex for six vertices. Therefore, PH 1 (G) has only one representative, the cycle 1,2,3,4,5,6,1; which appears in G 1 , so we say that t = 1 is the birth of the cycle. The cycle is null-homotopic in G 2 , so t = 2 is the death of the cycle.
We now turn our attention to PH 2 (G) , but in order to represent PH 2 (G) we need to introduce some new structure for the induced graphs. A triangle [a b c] in G i is a set of three vertices, a, b, and c, that form a trivial cycle in G i . That is, the edges {a, b} , {b, c} , and {a, c} are all present in G i . A closed surface in G i is a set of distinct triangles so that for each [a b c] in the set there is exactly one other triangle [a b d] also in the set. A closed surface in G i is trivial if the corresponding set of 2-simplices is nullhomotopic in G i . That is, the closed surface is "filled in" by some collection of 3-simplices in G i . For example, the octahedron in Fig. 5c is a non-trivial closed surface in G 2 because there are no 3-simplices in G 2 . In G 3 , however, we add edges between vertices at distance 3. In turn, we gain several 3-simplices, including [1 2 3 6] , [1 3 5 6] , [3 4 5 6] , and [2 3 4 6] . Figure 5d shows three of these 3-simplices to demonstrate how the closed surface from G 2 is filled in by all four.
Definition 5 (Representing Persistent Homology: Dimension 2) Let G = (V , E) be a network with one connected component. For each i ≥ 0 , we can identify the basis for H 2 (G i ) with a set S i of non-trivial closed surfaces in G i . The Fundamental Theorem of Persistent Homology allows us to choose these representatives so that if σ is a closed surface in S i , then exactly one of the following is true for any integer j ≥ 0 1. σ does not exist in G j , in which case j < i, 2. σ is trivial in G j , in which case i < j, 3. σ is a cycle in S j .
Thus we will refer to the closed surfaces in i≥0 S i as the representatives of PH 2 (G).
The geometric intuition for PH 2 (G) is similar to that of PH 1 (G) in identifying large 'voids' in G. If σ ∈ s≤i≤t S i , then σ is a closed surface with diameter at least t. The value of s is harder to describe, but is related to the density of vertices.

Example 3.5
We now consider PH 2 (G) for the hexagonal graph G in Example 3.1. Recall from Example 3.4 that G 0 and G 1 have no trivial cycles, and therefore contain no closed surfaces. We can see in Fig. 5 that G 2 has exactly one closed surface and it must be non-trivial, since there are no 3-simplices. Finally, G 3 has many closed surfaces, but because it contains every possible 3-simplex on six vertices, these are all trivial. Therefore, PH 2 (G) has only one representative, the octahedral closed surface in G 2 . This surface first appears in G 2 , so t = 2 is its birth, and the surface is filled by a solid in G 3 , so t = 3 is its death.
Definition 6 (Persistence Intervals) Recall that the birth of a representative σ ∈ PH p (G) (vertex, cycle, or closed surface) of the persistent homology of a network G is the smallest integer i so that σ ∈ G i , and the death of σ is the largest integer j so that σ ∈ G j−1 and σ is trivial in G k for k ≥ j , if such an integer exists. The persistence interval for σ is [a, b) , where a and b are the birth and death of σ , respectively. This represents the set of all parameter values i for which the equivalence class corresponding to σ is a non-trivial element of H p (G i ) . The persistence of σ is b − a.

Example 3.6
We now finish our consideration of the persistent homology of G from Fig. 4b. Recall from Example 3.3 that PH 0 (G) has six representatives. These all have birth t = 0 . Five of these have a death of t = 1 , and one of these has a death of ∞ . Therefore the persistence intervals for PH 0 (G) are [0, 1) × 5 and [0, ∞) × 1.
From Example 3.4, we know PH 1 (G) has one representative, with birth t = 1 and death t = 2 . Therefore the corresponding persistence interval is [1, 2) . Note that the diameter of the cycle is 3 and every pair of consecutive vertices is distance 1 apart. This follows the idea mentioned earlier that the representatives of PH 1 (G) indicate 'large' cycles. Specifically, the diameter of σ is at least the death of σ , and the birth of σ is the maximum distance between consecutive vertices. From Example 3.5, PH 2 (G) has one representative, with birth t = 2 and death t = 3 . Therefore, the persistence interval for that element is [2, 3) . Note that the diameter of the corresponding set of vertices is 3 in G. This also follows the idea mentioned earlier that PH 2 (G) identifies large 'voids' in G. Specifically, the death of σ is a lower bound on the diameter of σ.
Given the representatives chosen in Definitions 3, 4, 5, and 6, we have the following three observations regarding the persistent homology of a finite, undirected, unweighted graph G: (i) If G has n vertices, then PH 0 (G) will have exactly n persistence intervals, with exactly one [0, ∞) interval for each connected component and the rest will be [0, 1) intervals. (ii) In dimension 1, PH 1 (G) describes the number and sizes of the non-trivial cycles in the original network. The persistence intervals will all be of the form [1, b) for some integer b > 1 . The value of b is related to the diameter of the corresponding cycle. In the networks we have studied, we note that a persistence interval [1, b) in PH 1 (G) corresponds to a simple cycle with between 3b − 2 and 3b vertices, inclusive. (iii) In dimension 2, the voids we detect in PH 2 (G) tell us about the nontrivial intersections of cycles. Such intersections are hard to visualize but, roughly speaking, a representative in PH 2 (G) can only form if several large cycles intersect each other pairwise.

Comparing networks using persistent homology
In this section we demonstrate how methods based on persistent homology can be used to compare different networks. The two methods we introduce in this paper are based on using (a) the bottleneck distance and (b) the persistence curves of a given set of networks. Both (a) and (b) rely on first computing persistence intervals then analyzing the differences in these intervals. The two networks we consider throughout this section to demonstrate these methods are the Tikopia genealogical network from Fig. 1 (left) and the hexagonal network from Fig. 4. The persistence intervals for these networks are given in Table 1, respectively. Table 1 The persistence intervals of the Tikopia genealogical network and the hexagon network are shown Here the notation [a, b) × k indicates that the network has k persistence intervals [a, b). The corresponding persistence diagrams are shown in Fig. 6 and the corresponding persistence curve for the Tikopia network is shown in Fig. 7

Dimension
Interval Type and Persistence

Persistence diagrams and bottleneck distance
One common way to represent persistence intervals is to plot them as points in R × (R ∪ {∞}) , which is typically referred to as a persistence diagram. While this method of visualizing a network's persistent homology does not indicate how often a given persistence interval occurs, it does provide information on what kind of persistence intervals occur for a given network.
Definition 7 (Persistence Diagrams) Let PH p (G) be the pth persistent homology of a network G. The persistence diagram for PH p (G) is a multiset of points in R × (R ∪ {∞}) defined as follows.
• For each σ ∈ PH p (G) with persistence interval [a, b) , we include one copy of the point (a, b). • For each c ∈ R , we include infinitely many copies of the point (c, c).
Note that we include the points (a, a) to represent features in G that are considered trivial in PH p (G) , such as cycles consisting of exactly three vertices. This inclusion is necessary for us to define a meaningful metric on the space of persistence diagrams. The metric we use here is called the bottleneck distance.
Definition 8 (Bottleneck Distance) Let S 1 and S 2 be persistence diagrams for two graphs G and H, respectively. Let η range over the set of bijections from S 1 to S 2 . Then the bottleneck distance between S 1 and S 2 is The Fundamental Theorem of Persistent Homology (introduced in Zomorodian and Carlsson (2005), explained well in Otter et al. (2017) and Aktas et al. (2019)) ensures that if two graphs are isomorphic, the corresponding persistence diagrams will be equal, and thus the bottleneck distance will be 0. However, it is possible for non-isomorphic graphs to have identical persistence diagrams. Table 1) include, as a subset, the persistence intervals from the hexagonal network we considered in Example 3.6. We can form a bijection between the persistence diagrams of the Tikopia and hexagonal network by identifying the non-trivial intervals from the hexagonal network with those of the Tikopia network. We then map any additional intervals from the Tikopia network of the form [a, b) to the trivial interval [ a+b 2 , a+b 2 ) . (The perceptive reader may notice that this is not clearly a bijection, but there is a standard technique from set theory for modifying it to be bijective.)

Example 4.1 (Bottleneck Distance Between the Tikopia and Hexagonal Networks) Notice that the persistence intervals for the Tikopia genealogical network (see
This mapping is shown in Fig. 6 (right). Here, [1, 7) is mapped to [4,4) . As this pair of points is further apart than any other pair in this bijection, the bottleneck distance for the two networks is at most three, since we take an infimum over all possible bijections. Conversely, there is no interval in the hexagonal persistence diagram that is closer to [1, 7) than 3, so the bottleneck distance is at least three. Thus, the bottleneck distance for these two persistence diagrams is exactly 3.
Suppose that two networks, each of which is connected, admit isometric embeddings in R n . The Stability Theorem (Cohen-Steiner et al. 2007) guarantees that if the Hausdorff distance between the embeddings is δ , then the bottleneck distance for the corresponding persistence diagrams is at most δ . For example, if the PH 1 persistence diagrams differ by δ , then any attempt to pair up cycles in the networks must include at least one pair of cycles for any isometric embedding that are δ apart in that embedding. In "Network comparison using bottleneck distance" we apply this idea to a large collection of genealogical and social networks.

Persistence curves
For the network data we consider, persistence diagrams obfuscate a key difference that we consider important: the number of persistence intervals. For a simple example of this, consider networks of the form V = {1, 2, . . . , n} with edges of the form {i, i + 1} for 1 ≤ i < n . For n ≥ 2 , any network of this type will have persistence intervals [0, 1) × (n − 1) and [0, ∞) × 1 . However, when plotting the persistence diagram we will only 'see' two points: (0, 1) and (0, ∞).
To address this limitation, we introduce the notion of a persistence curve as a new way to visualize the persistent homology of a network (see Definition 9). The difference between the persistence curve and the persistence diagram of a network is that the persistence curve also includes the number of intervals of a particular type. To create a persistence curve we first compute a network's persistence intervals, then sort the intervals of a given dimension by their persistence into a bar graph. For instance, in dimension 1 the Tikopia genealogical network has thirteen [1, 2) intervals, nineteen [1, 3) intervals, Fig. 6 Left: The persistence diagram of the hexagonal network in Fig. 4b is shown. Center: The persistence diagram of the Tikopia genealogical network in Fig. 1 (left) is shown. Right: A bottleneck bijection between the persistence intervals of the hexagonal and Tikopia family network is shown. Orange lines show which points are matched to points of the form (a, a) where a ∈ R etc. which are sequentially stacked as shown in Fig. 7 (left) to create what we will call a barcode. To create the associated persistence curve we connect the endpoints of each subsequent bar as shown in Fig. 7 (right).
In dimension-one, the birth times of our intervals will all start at 1, as the networks we consider are unweighted, undirected, and connected. This means that in this dimension the resulting bar graph is also a plot of the death times for each interval. For higherdimensions, which have varied birth times, we also plot the lengths of the intervals but for simplicity we start at 1 as in dimension-one.
A formal definition of a network's persistence curves is the following.
Definition 9 (Persistence Curves) Let G = (V , E) be a network with nonempty vertex and edge sets. Let {[a j , b j )} be the set of all persistence intervals for each σ j ∈ PH n (G) where j ∈ N . For all n ∈ N the persistence curve PH n (G) is the linear interpolation of the set of points Visualizing persistence intervals as a curve allows us to compare the persistent homology of different networks in a similar fashion to persistence diagrams while retaining different information. In particular, we can see how many intervals there are of a given persistence, whereas the persistence diagram only indicates the presence of such an interval. In what follows we will typically plot the persistence curves of multiple networks on the same axes to indicate what differences exist in the persistent homology of different networks (cf. "Results").

Data
The data we consider in this paper is of two types; genealogical network data and other social network data. The genealogical networks we consider are drawn from ninety-seven genealogical networks found in (https:// www. kinso urces. net/ brows er/ datas ets. xhtml), which range in size from n = 17 to 5, 016 individuals. The social network data we use is taken from twenty-seven different social networks obtained from (http:// snap. stanf ord. edu/ data/ index. html# socne ts, http:// snap. stanf ord. edu/ data/ index. html# socne ts, http:// netwo rkrep osito ry.  Table 1. Right: The associated persistence curve for the Tikopia network in Fig. 1 is shown com/ soc. php, http:// netwo rkrep osito ry. com/ soc. php). These range in size from n = 16 to 2, 539 individuals. (See Table 2 in the Appendix for a full description of this data set.) Although many larger genealogical and social network data sets are available we are limited by both the temporal and spacial complexity of the algorithm used to compute persistence intervals. The program we used, called Ripser (from the python package Ripser) (Ripser 2021), has a computational and spacial complexity of O((n + m) 3 ) where n is the number of individuals and m is the number of edges in a network. The number n + m is the number of simplicies in the network. In the genealogical networks we consider there are between n + m = 41 to 15, 735 simplicies and in the social networks we consider between n + m = 41 to 19, 056 simplices.
To understand how a network's persistence intervals are effected by the completeness or incompleteness of data we also consider subnetworks sampled from a few, much larger, genealogical and social networks. These sampled networks are created by randomly selecting an individual with a single neighbor, i.e. a vertex of degree 1, then performing a breadth-first-search starting with this individual to find the η closest individuals in the network to this individual. Because of the spatial and computational limitations of Ripser we choose 600 ≤ η ≤ 3, 000 to ensure we can compute the persistence intervals of these sampled networks. In total we sampled from four different genealogical networks and four different social networks. These are the Advogat, LastFM Asia, Deezer HU and Deezer RO social networks and the genealogical networks 96-99 shown in Table 2, respectively. We sampled from each of these networks five times each to create a total of 20 sampled genealogical networks and 20 sampled social networks. The reason we begin our breadth-first search with a vertex of degree 1 is to ensure that our sampled networks have vertices both on the boundary and the interior of the original network we sampled to better mimic the structure of the original genealogical and social networks.
Apart from the (i) genealogical and social networks we consider and (ii) sampled versions of these networks, we also consider what we refer to as (iii) atypical genealogical networks. There are a number of genealogical networks that appear to be created with no attempt to represent all or even a fraction of the familial relationships. For example, the US Presidents network, cited as Atyp. Gen. Network 2 in Table 2, follows the shortest genealogical path between presidents leaving out extraneous relationships. We consider a number these atypical genealogical networks, which form a contrast to the more standard genealogical networks we consider especially in terms of their peristent homology. A description of each of the (i) genealogical, social, (ii) sampled genealogical, sampled social, and (iii) atypical genealogical networks we consider is given at the end of the Appendix.

Results
Here we compare genealogical and other social networks using the (a) bottleneck distance and the (b) persistence curves defined in "Comparing networks using persistent homology" (see Definitions 8 and 9, respectively). For those who have skipped in "Persistent homology of networks" and "Comparing networks using persistent homology", the bottleneck distance gives us a distance between two networks based on the differences in their persistent homology. Persistence curves give us a way of visualizing this difference but in greater detail (cf. Figure 7).

Network comparison using bottleneck distance
Here we compute the bottleneck distance between every pair from the social and genealogical networks we consider. To visualize these results we use principal component analysis to identify the two components that account for the most variance and then plot this data in R 2 (see Fig. 8).
From each part of Fig. 8 we can see that genealogical networks are generally separated from social networks and form clusters that are easily distinguished. For the sampled networks (shown left), we can easily separate genealogical and social networks, and we can identify at least two distinct subclasses of genealogical networks. However, the bottleneck distance does an inferior job separating the non-sampled genealogical and social networks (shown center and right). The exception are the atypical genealogical networks, whose persistence intervals differ significantly enough from all of the other networks to be distinguishable as a third class of networks (shown center).

Comparison of genealogical and social networks using persistence curves
Persistence curves give us a new alternative way of comparing networks. The advantage of using these curves compared to the bottleneck distance is that these curves give us a more detailed picture of how the number of persistence intervals varies from network to network. This allows us to better differentiate the structure of genealogical networks from social networks as well as observe the structure common to genealogical networks and those common to social networks, respectively.
In Fig. 9 the persistence curves for the unsampled genealogical and unsampled social networks are shown in blue and red, respectively. The atypical genealogical networks are shown in green. The social networks have persistence curves that are quite vertical in both dimension 1 and dimension 2. For dimension 1, this indicates that most cycles in a social network are close to being trivial; either because they have a relatively small circumference or because they can be decomposed into a union of cycles with small circumferences. In particular, most of the social networks have a maximum death time of three (see Definition 2), which corresponds to having a basis of cycles whose maximal circumference is at most nine. In other words, any cycle of circumference ten or more decomposes as the union of smaller cycles. For dimension 2, the steepness of the persistence curves indicate the presence of many distinct, yet similar, paths between certain pairs of vertices. In contrast, the genealogical networks have persistence curves that have a much more horizontal profile indicating that most cycles are quite long and there are fewer 'alternate paths' between pairs of vertices. In the extreme, the atypical genealogical networks are nearly flat in dimension 1, which reflects the fact that these atypical networks were intentionally constructed to have very few cycles. In dimension 2, the atypical networks show a similar slope to most of the typical genealogical networks, but the size of the alternative paths in these networks are much larger. This is likely due to the high number of individuals who were added only to link distant individuals, e.g. presidents. In a typical genealogical network, the additional relationships between such individuals would allow large cycles to decompose but in the atypical genealogical networks this in not the case.
In Fig. 10, we see the persistence curves for the sampled genealogical and sampled social networks shown in blue and red, respectively. The atypical genealogical networks are shown in green. Again the social networks have persistence curves that are quite vertical in both dimensions, although these curves are not as tall as in the case of unsampled social networks. This indicates that as a social network is sampled it retains a similar proportion of close-to-trivial cycles, but may lose many of the alternative paths between vertices that appear in dimension 2. By contrast, for genealogical networks the persistence curves indicate the complete loss of very large cycles in conjunction with a proportional loss of closeto-trivial cycles. In dimension 2, genealogical networks experience a more severe loss of alternative paths than the social networks. As a result, though sampling shrinks the scale of the persistence curves for social and genealogical networks, they remain visually distinct.
As in the bottleneck distance plots, genealogical and social networks appear to cluster together in that they have similar types of persistence curve. In fact, this is true whether or not the networks are sampled or unsampled. This suggests that even with incomplete Fig. 9 Comparison of persistence curves for full networks vs sampled networks, grouped by dimension and type of network. Upper Row: Sampling social networks typically stretches the persistence curve in only one axis without affecting the other axis. Lower Row: Sampling genealogical networks typically shrink the persistence curve in both axes. Overall the average slope for social networks tends to increase when sampled, while genealogical networks experience a decrease in average slope data social network and genealogical networks have a distinguishable persistent homology, at least at the scales we consider.
It is worth mentioning that, while the bottleneck distance plots show us to an extent how different genealogical and social networks are the persistence curves show us what are differences are. The distance plots in Fig. 8 do have the advantage of simplicity, however, and could presumably be used to more quickly identify differences in networks that are not as apparent as those we find between genealogical and social networks.

Connections
It is also possible to use persistent homology to study properties of a network, such as the number of connected components, the typical size of cycles, or even "missing links" in the data. For genealogical and social networks, we can convert these mathematical concepts into more familiar ideas such as family groups or common ancestors. This also allows us to make conjectures about the persistent homology for such networks by converting standard assumptions about families or social networks into the language of persistence.
In dimension 0, the number of connected components determines the number of [0, ∞) intervals, and the total number of distinct vertices is the number of [0, ∞) intervals plus the number of [0, 1) intervals. In the context of a genealogical network, each connected component represents a family group that is not related to the other family groups by any known connection. Thus, if a given family network is indeed a single "family" of relatives, there should be exactly one [0, ∞) interval. In our Tikopia example we have eight [0, ∞) intervals each of which correspond to exactly one connected component of this genealogical network. (Note that Fig. 1 (left) shows only the largest of these components). In this example, most of the the other 'family groups' are actually individuals with no relation edges in the network.

Fig. 10
Upper Row: Comparison of persistence curves for full networks by type. Lower Row: Comparison of persistence curves for sampled networks by type, excluding atypical genealogical networks. In each dimension, the average slope for genealogical networks is typically lower than the average slope for a social network. The atypical genealogical networks have the lowest average slope and much greater total length. The behavior for average slopes is more pronounced for sampled networks than for full networks In social networks, the connected components create what could be referred to as friend groups. Unlike genealogical networks, there are usually few restrictions on which edges form in a social network. As such, we do not have a conjecture about the number of [0, ∞) intervals in this setting in general. However, sampling any network as described in "Data" will result in a new network with a single [0, ∞) interval.
Moving to dimension 1, persistence intervals in this dimension describe the way that each connected component is internally structured. In sufficiently large genealogical networks, we will see three kinds of features that we call common ancestors, union cycles, and hybrid cycles. A common ancestor cycle occurs when two descendants of an individual form a union or have a child together. We use the term union cycle to refer to situations where a cycle is formed through union edges and edges connecting two siblings. The final type of cycle of note, the hybrid cycles, are those formed by any other combination of parent-child edges and union edges, which includes everything that is not a strict common ancestor or union cycle. These three types of cycles are illustrated in Fig. 11, where marriage edges are indicated by red edges and parent-child edges are indicated by blue edges. We show a common ancestor in Fig. 11a. Figure 11b is an example of a union cycle in which two siblings in one family form unions with two siblings in another, where only a single parent in each family is shown. In Fig. 11c we give an example of a θ-cycle, which is the union of a common ancestor cycle and two overlapping hybrid cycles. This example comes from siblings of one family marrying cousins from another family. These cycles can be any length theoretically, but cultural norms affect the typical size and number of each type of cycle differently. Recording practices and incomplete data also limit whether these cycles appear in a given dataset. Thus having a description of these cycles together with an understanding of the culture may help identify errors in the recorded data. Conversely, understanding the distribution of cycles in high fidelity datasets can help identify the underlying cultural norms and help extrapolate where individuals are missing in incomplete data sets.
Since many cultures avoid marrying close relatives, common ancestor cycles tend to have a fairly large circumference. In the Tikopia network (see Fig. 1) we see persistence intervals with death values as high as 7 corresponding to cycles with a circumference Fig. 11 Left: A common ancestor cycle. The top most vertex is a common ancestor of the lowest vertex. The horizontal red line is a marriage, all other lines are parent-children edges. Center: A union cycle, specifically the double cousin situation described in "Background: genealogical and social networks". The left-most and right-most vertices are parents of their neighboring vertices. The two horizontal red lines are marriage edges. Right: A θ-cycle formed by a common ancestor cycle with two overlapping hybrid cycles of at least 21 individuals, which appear to be common ancestor cycles. This partially explains why persistence curves are so flat: there are relatively few minimal common ancestor cycles in a network, but they have very high persistence. More precisely, if the distance to union (the total number of individuals in a common ancestor cycle) is n, then the persistence of that cycle is ⌊n/3⌋ . However, the representatives of persistent homology only include a basis for these cycles, instead of including every possible distinct cycle. In particular, a large common ancestor cycle will decompose into the union of two hybrid cycles if the hybrid cycles are each shorter than the common ancestor cycle, as shown in Fig. 11c. Persistent homology will reflect the size of the two smaller cycles instead of the larger common ancestor cycle. We note that it is possible to identify the actual cycles chosen for our basis, but the software we used does not provide that information and size of the networks prohibits us from identifying the cycles manually.
In social networks, we see that highly persistence cycles are quite rare. In order to have a cycle of persistence 3, for instance, we need a loop with circumference 9 or higher with no shorter paths between any two vertices in the loops. It may be that phenomena like the small-world effect or, more colloquially, six-degrees of freedom limit the maximal persistence of social networks. We see this reflected in our example data sets with a maximum persistence of 3 for all but one of the social networks.

Conclusion
In this paper, we explore the persistent homology structure of genealogical networks, motivated by the observation that family links tend to form in a fixed range of intermediate distances, which makes genealogical networks homologically distinct from most other social networks. We also introduce the notion of a persistence curve, which can be used to summarize and compare the persistent homology structure of any network. We also relate specific genealogical structures, such as the common ancestor cycle, to homology objects.
We find that, in the presence of incomplete data homology analysis is still genealogically useful. We note missing data due to recording practices and incomplete data (a ubiquitous feature of real genealogical networks), limits the kind of cycles that appear in a given dataset. Thus having a description of these cycles together with an understanding of the culture may help identify errors in the recorded data. Conversely, understanding the distribution of cycles in high fidelity datasets can help identify the underlying cultural norms and help extrapolate where individuals are missing in incomplete data sets.
There are several interesting directions in which this work could be expanded. For example, our work has made it clear that there is a real need to analyze the persistent homology of large networks, with at least tens of thousands of nodes, since family formation generally takes place at these scales. The Ripser library we relied on was not able to reach these scales. Additionally, we are very interested in creating random graph models which reflect the actual homology of human family networks-a first attempt at this by our group has been fairly successful at the scale of hundreds of nodes (Flores 2021). More broadly, there is a need to model the ground truth human family network. All the extant data sources represent biased, limited, and noisy subnetworks, while the true interest of the genealogical community is in the ground truth network. Tools for signal denoising, image inpainting, and graph extrapolation, for example, could be useful in this context. Finally, an important aspect of genealogical